Churn Rate?

Why does a customer leave telecom company? Most people leave any service because of dissatisfaction with the way they are treated. They would not be looking around if they were happy with their current provider, its service and employees. source

Accenture reports that 77% of consumers are no longer as loyal to brands as they were even three years ago. Much like everyone else, the telecom industry must work harder than ever at customer retention. source

Load Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

Read data

In [2]:
customer_info = pd.read_csv('Customer_info.csv',na_values = 'N.A')
customer_info.rename(columns = {'type':'subcriber_type'},inplace = True)
# Dimensions
print("Number of Rows: {} \nNumber of Columns: {}".format(customer_info.shape[0],customer_info.shape[1]))
customer_info.head()
Number of Rows: 49998 
Number of Columns: 13
Out[2]:
name dob doj gender custid state state_id city county doorno zipcode phone subcriber_type
0 Clary Chantz 1964-06-10 2017-09-04 M 300129491 Minnesota MN Freeborn Freeborn 35 56032 2189087130 kinetic
1 Nakhi Gus 1941-04-09 2018-09-07 M 588672795 Nebraska NE Monroe Platte 5 68647 4026196486 kinetic
2 Marisela Thadeus 1981-12-29 2018-12-16 F 92138460 Kentucky KY De Mossville Pendleton 123 41033 6065720091 kinetic
3 Yazlin Able 1989-09-04 2018-05-07 F 255027674 South Carolina SC Fort Mill York 83 29708 8035019371 kinetic
4 Kagen Kunga 1920-07-21 2018-05-16 M 759549828 New York NY Fredonia Chautauqua 13 14063 7188912593 kinetic
In [3]:
customer_info.custid.nunique()
Out[3]:
49998

Postpaid usage

In [4]:
Postpaid_usage = pd.read_csv('Postpaid_usage.csv',na_values = 'N.A')
Postpaid_usage.rename(columns = {'date':'billing_date'},inplace = True)
# Dimensions
print("Number of Rows: {} \nNumber of Columns: {}".format(Postpaid_usage.shape[0],Postpaid_usage.shape[1]))
Postpaid_usage.head()
Number of Rows: 934203 
Number of Columns: 5
Out[4]:
custid billing_date amount plan status
0 300129491 2017-09-04 145 1_Gbps active
1 300129491 2017-10-04 60 1_Gbps active
2 300129491 2017-11-03 60 1_Gbps active
3 300129491 2017-12-03 60 1_Gbps active
4 300129491 2018-01-02 60 1_Gbps active
In [5]:
Postpaid_usage.custid.nunique()
Out[5]:
49790

Most recent bill date

In [6]:
postpaid_billdate = Postpaid_usage.groupby('custid')['billing_date'].apply(set).reset_index()
postpaid_billdate['billing_date'] = postpaid_billdate['billing_date'].apply(list)
recent_billdate = []
for billdate in postpaid_billdate['billing_date']:
    dates = sorted(billdate, key=lambda x: datetime.strptime(x,'%Y-%m-%d'))
    recent_billdate.append(dates[-1])
postpaid_billdate['last_billing_date'] = recent_billdate
In [7]:
postpaid_status = Postpaid_usage.groupby('custid')['status'].apply(set).reset_index()
postpaid_status['status'] = postpaid_status['status'].apply(lambda x: str(x)[2:-2])
postpaid_status.head()
Out[7]:
custid status
0 10015596 active
1 10018000 active
2 10045793 active
3 10050587 active
4 10063782 active
In [8]:
postpaid_status.status.value_counts()
Out[8]:
active      38282
inactive    11508
Name: status, dtype: int64
In [9]:
postpaid_status.custid.nunique()
Out[9]:
49790
In [10]:
postpaid_amount = Postpaid_usage.groupby('custid')['amount'].agg(sum).reset_index()
postpaid_amount.head()
Out[10]:
custid amount
0 10015596 1595
1 10018000 560
2 10045793 660
3 10050587 445
4 10063782 170
In [11]:
postpaid_amount.custid.nunique()
Out[11]:
49790
In [12]:
postpaid_plan = Postpaid_usage.groupby('custid')['plan'].apply(set).reset_index()
postpaid_plan['plan'] = postpaid_plan['plan'].apply(str)
postpaid_plan.head()
Out[12]:
custid plan
0 10015596 {'200_Mbps'}
1 10018000 {'25_Mbps'}
2 10045793 {'100_Mbps'}
3 10050587 {'200_Mbps'}
4 10063782 {'1_Gbps', '25_Mbps'}
In [13]:
postpaid_plan.plan.value_counts()
Out[13]:
{'500_Mbps'}                                        4802
{'300_Mbps'}                                        4776
{'1_Gbps'}                                          4730
{'200_Mbps'}                                        4716
{'100_Mbps'}                                        4707
                                                    ... 
{'1_Gbps', '500_Mbps', '100_Mbps', '300_Mbps'}        16
{'100_Mbps', '200_Mbps', '300_Mbps', '25_Mbps'}       15
{'500_Mbps', '100_Mbps', '300_Mbps', '25_Mbps'}       14
{'1_Gbps', '100_Mbps', '200_Mbps', '300_Mbps'}        14
{'200_Mbps', '500_Mbps', '300_Mbps', '100_Mbps'}       8
Name: plan, Length: 71, dtype: int64
In [14]:
postpaid_plan.custid.nunique()
Out[14]:
49790

Post journey

In [15]:
post_journey = pd.read_csv('post_journey.csv',na_values = 'N.A')
post_journey.rename(columns = {'date':'event_date','type':'event_type'},inplace = True)
# Dimensions
print("Number of Rows: {} \nNumber of Columns: {}".format(post_journey.shape[0],post_journey.shape[1]))
post_journey.head()
Number of Rows: 253541 
Number of Columns: 6
Out[15]:
event_date description event_type custid daystoresolve datecompleted
0 2017-03-31 new_installation order 10015596 NaN NaN
1 2018-12-27 others ticket 10015596 18.0 2019-01-14
2 2018-08-29 new_installation order 10018000 NaN NaN
3 2018-11-06 new_installation order 10045793 NaN NaN
4 2019-02-12 service_degradation_internet ticket 10045793 7.0 2019-02-19
In [16]:
post_journey.custid.nunique()
Out[16]:
48599
In [17]:
post_journey_event_date = post_journey.groupby('custid')['event_date'].apply(set).reset_index()
post_journey_event_date['event_date'] = post_journey_event_date['event_date'].apply(list)
post_journey_event_date.head()
Out[17]:
custid event_date
0 10015596 [2018-12-27, 2017-03-31]
1 10018000 [2018-08-29]
2 10045793 [2018-11-06, 2019-02-12]
3 10050587 [2019-09-24, 2019-05-18, 2019-11-21, 2019-08-07]
4 10063782 [2020-01-10, 2019-12-11, 2019-12-26]

last event date

In [18]:
last_eventdate = []
for eventdate in post_journey_event_date['event_date']:
    dates = sorted(eventdate, key=lambda x: datetime.strptime(x,'%Y-%m-%d'))
    last_eventdate.append(dates[-1])
post_journey_event_date['last_event_date'] = last_eventdate
In [19]:
post_journey_event_date.head()
Out[19]:
custid event_date last_event_date
0 10015596 [2018-12-27, 2017-03-31] 2018-12-27
1 10018000 [2018-08-29] 2018-08-29
2 10045793 [2018-11-06, 2019-02-12] 2019-02-12
3 10050587 [2019-09-24, 2019-05-18, 2019-11-21, 2019-08-07] 2019-11-21
4 10063782 [2020-01-10, 2019-12-11, 2019-12-26] 2020-01-10
In [20]:
post_journey_event_type = post_journey.groupby('custid')['event_type'].apply(set).reset_index()
post_journey_event_type['event_type'] = post_journey_event_type['event_type'].apply(list)
post_journey_event_type.head()
Out[20]:
custid event_type
0 10015596 [ticket, order]
1 10018000 [order]
2 10045793 [ticket, order]
3 10050587 [ticket, order]
4 10063782 [ticket, order]
In [21]:
post_journey_desc = post_journey.groupby('custid')['description'].apply(set).reset_index()
post_journey_desc['description'] = post_journey_desc['description'].apply(list)
post_journey_desc.head()
Out[21]:
custid description
0 10015596 [others, new_installation]
1 10018000 [new_installation]
2 10045793 [service_degradation_internet, new_installation]
3 10050587 [service_degradation_internet, new_installatio...
4 10063782 [plan_change, new_installation, service_outage...

days to resolve

In [22]:
postpaid_daystoresolve = post_journey.groupby('custid')['daystoresolve'].agg(sum).reset_index()
postpaid_daystoresolve.head()
Out[22]:
custid daystoresolve
0 10015596 18.0
1 10018000 0.0
2 10045793 7.0
3 10050587 17.0
4 10063782 5.0

Merging all the dataframes

In [23]:
# compile the list of dataframes you want to merge
data_frames = [customer_info, postpaid_billdate, postpaid_status, postpaid_amount,postpaid_plan,
              post_journey_event_date,post_journey_event_type,post_journey_desc,postpaid_daystoresolve]
In [24]:
from functools import reduce
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['custid'],
                                            how='outer'), data_frames).fillna(np.nan)
df_merged.custid.nunique()
Out[24]:
49998
In [25]:
# Dimensions
print("Number of Rows: {} \nNumber of Columns: {}".format(df_merged.shape[0],df_merged.shape[1]))
df_merged.head()
Number of Rows: 49998 
Number of Columns: 23
Out[25]:
name dob doj gender custid state state_id city county doorno ... billing_date last_billing_date status amount plan event_date last_event_date event_type description daystoresolve
0 Clary Chantz 1964-06-10 2017-09-04 M 300129491 Minnesota MN Freeborn Freeborn 35 ... [2019-07-26, 2019-03-28, 2019-10-24, 2018-07-0... 2020-01-22 active 1970.0 {'1_Gbps'} [2019-10-17, 2018-12-27, 2018-10-09, 2017-09-0... 2019-10-17 [ticket, order] [service_outage_internet, service_degradation_... 24.0
1 Nakhi Gus 1941-04-09 2018-09-07 M 588672795 Nebraska NE Monroe Platte 5 ... [2019-06-04, 2018-12-06, 2018-10-07, 2019-04-0... 2020-01-30 active 830.0 {'200_Mbps'} [2019-02-28, 2018-11-22, 2018-11-30, 2018-09-0... 2019-11-09 [ticket, order] [new_installation, service_outage_voice_call, ... 31.0
2 Marisela Thadeus 1981-12-29 2018-12-16 F 92138460 Kentucky KY De Mossville Pendleton 123 ... [2019-05-15, 2019-10-12, 2019-09-12, 2019-12-1... 2019-12-11 inactive 670.0 {'1_Gbps', '200_Mbps', '100_Mbps'} [2019-02-28, 2019-04-09, 2019-12-11, 2019-04-3... 2019-12-11 [ticket, order] [service_outage_internet, service_degradation_... 123.0
3 Yazlin Able 1989-09-04 2018-05-07 F 255027674 South Carolina SC Fort Mill York 83 ... [2019-11-28, 2019-07-31, 2019-07-01, 2018-08-0... 2019-11-28 inactive 725.0 {'1_Gbps', '25_Mbps'} [2019-09-04, 2018-08-05, 2018-05-07, 2019-11-28] 2019-11-28 [ticket, order] [plan_change, new_installation, disconnect, se... 1.0
4 Kagen Kunga 1920-07-21 2018-05-16 M 759549828 New York NY Fredonia Chautauqua 13 ... [2018-08-14, 2018-10-13, 2019-07-10, 2019-01-1... 2019-12-07 inactive 620.0 {'25_Mbps'} [2019-07-27, 2019-07-13, 2019-08-11, 2019-03-1... 2019-12-07 [ticket, order] [service_degradation_voice_call, suspend, new_... 84.0

5 rows × 23 columns

Number of days between two event dates

In [26]:
def days_between(d1, d2):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    d2 = datetime.strptime(d2, "%Y-%m-%d")
    return abs((d2 - d1).days)

days_btw_events = []
for ind,date in enumerate(df_merged['event_date']):
    try:
        dates = sorted(date, key=lambda x: datetime.strptime(x,'%Y-%m-%d'))
        days_btw_events.append(days_between(dates[0],dates[-1]))
    except:
        days_btw_events.append(0)
    
    
df_merged['event_date_gap'] = days_btw_events
df_merged.head()
Out[26]:
name dob doj gender custid state state_id city county doorno ... last_billing_date status amount plan event_date last_event_date event_type description daystoresolve event_date_gap
0 Clary Chantz 1964-06-10 2017-09-04 M 300129491 Minnesota MN Freeborn Freeborn 35 ... 2020-01-22 active 1970.0 {'1_Gbps'} [2019-10-17, 2018-12-27, 2018-10-09, 2017-09-0... 2019-10-17 [ticket, order] [service_outage_internet, service_degradation_... 24.0 773
1 Nakhi Gus 1941-04-09 2018-09-07 M 588672795 Nebraska NE Monroe Platte 5 ... 2020-01-30 active 830.0 {'200_Mbps'} [2019-02-28, 2018-11-22, 2018-11-30, 2018-09-0... 2019-11-09 [ticket, order] [new_installation, service_outage_voice_call, ... 31.0 428
2 Marisela Thadeus 1981-12-29 2018-12-16 F 92138460 Kentucky KY De Mossville Pendleton 123 ... 2019-12-11 inactive 670.0 {'1_Gbps', '200_Mbps', '100_Mbps'} [2019-02-28, 2019-04-09, 2019-12-11, 2019-04-3... 2019-12-11 [ticket, order] [service_outage_internet, service_degradation_... 123.0 360
3 Yazlin Able 1989-09-04 2018-05-07 F 255027674 South Carolina SC Fort Mill York 83 ... 2019-11-28 inactive 725.0 {'1_Gbps', '25_Mbps'} [2019-09-04, 2018-08-05, 2018-05-07, 2019-11-28] 2019-11-28 [ticket, order] [plan_change, new_installation, disconnect, se... 1.0 570
4 Kagen Kunga 1920-07-21 2018-05-16 M 759549828 New York NY Fredonia Chautauqua 13 ... 2019-12-07 inactive 620.0 {'25_Mbps'} [2019-07-27, 2019-07-13, 2019-08-11, 2019-03-1... 2019-12-07 [ticket, order] [service_degradation_voice_call, suspend, new_... 84.0 570

5 rows × 24 columns

In [27]:
df_merged.describe()
Out[27]:
custid doorno zipcode phone amount daystoresolve event_date_gap
count 4.999800e+04 49998.000000 49998.000000 4.999800e+04 49790.000000 48599.000000 49998.000000
mean 5.058391e+08 62.382755 45093.511680 5.794116e+09 919.467664 27.103953 409.118765
std 2.855759e+08 35.713082 22750.522219 2.355261e+09 487.163964 24.755458 267.307751
min 1.001560e+07 1.000000 6390.000000 2.050012e+09 110.000000 0.000000 0.000000
25% 2.599269e+08 32.000000 28420.500000 3.524786e+09 560.000000 7.000000 210.000000
50% 5.054546e+08 62.000000 42374.000000 5.738927e+09 870.000000 22.000000 390.000000
75% 7.536277e+08 93.000000 68038.000000 7.868804e+09 1220.000000 40.000000 589.000000
max 9.999808e+08 124.000000 88436.000000 9.849979e+09 3530.000000 212.000000 1650.000000

Missing values?

In [28]:
df_merged.isnull().sum()
Out[28]:
name                    0
dob                     0
doj                     0
gender                  0
custid                  0
state                   0
state_id                0
city                    0
county                  0
doorno                  0
zipcode                 0
phone                   0
subcriber_type          0
billing_date          208
last_billing_date     208
status                208
amount                208
plan                  208
event_date           1399
last_event_date      1399
event_type           1399
description          1399
daystoresolve        1399
event_date_gap          0
dtype: int64

For 208 customers we don't have the status. Therefore we will drop those 208 rows

In [29]:
df_merged.dropna(subset = ['status'],inplace = True)
df_merged = df_merged.reset_index(drop = True)
df_merged['event_type'].fillna('None',inplace = True)
df_merged['description'].fillna('None',inplace = True)
df_merged['daystoresolve'].fillna(0,inplace = True)
df_merged.isnull().sum()
Out[29]:
name                    0
dob                     0
doj                     0
gender                  0
custid                  0
state                   0
state_id                0
city                    0
county                  0
doorno                  0
zipcode                 0
phone                   0
subcriber_type          0
billing_date            0
last_billing_date       0
status                  0
amount                  0
plan                    0
event_date           1191
last_event_date      1191
event_type              0
description             0
daystoresolve           0
event_date_gap          0
dtype: int64
In [30]:
df_merged.shape
Out[30]:
(49790, 24)

Memory handling

In [31]:
# credits: The function was taken from https://www.kaggle.com/artgor/brute-force-feature-engineering 
def reduce_memory(df, verbose=True):
    """Function to reduce memory size of attributes"""
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Memory usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df
In [32]:
# Before Memory reduction
print("Total memory used before Memory reduction {:5.2f}Mb".format(df_merged.memory_usage().sum() / 1024**2))
Total memory used before Memory reduction  9.12Mb
In [33]:
# After Memory reduction
df_merged = reduce_memory(df_merged)
print("Total memory used after Memory reduction {:5.2f}Mb".format(df_merged.memory_usage().sum() / 1024**2))
Memory usage decreased to  7.55 Mb (17.2% reduction)
Total memory used after Memory reduction  7.55Mb

Average Monthly Usage

for this, lets consider postpaid_usage data

In [34]:
Postpaid_usage['billing_date'] = pd.to_datetime(Postpaid_usage['billing_date'])
Postpaid_usage.set_index('billing_date',inplace = True)
In [35]:
# Average monthly usage (average monthly amount)
Postpaid_usage['amount'].resample('M').agg('mean')
Out[35]:
billing_date
2015-05-31    125.000000
2015-06-30     40.000000
2015-07-31    103.333333
2015-08-31     92.142857
2015-09-30     73.000000
2015-10-31     73.437500
2015-11-30     68.333333
2015-12-31     71.515152
2016-01-31     71.470588
2016-02-29     71.319444
2016-03-31     61.980198
2016-04-30     60.440000
2016-05-31     65.432432
2016-06-30     67.044534
2016-07-31     61.309524
2016-08-31     65.363436
2016-09-30     62.029617
2016-10-31     62.993590
2016-11-30     62.023327
2016-12-31     60.597826
2017-01-31     60.235330
2017-02-28     58.244296
2017-03-31     60.423661
2017-04-30     59.015328
2017-05-31     57.947840
2017-06-30     57.398603
2017-07-31     56.820898
2017-08-31     56.016975
2017-09-30     54.942088
2017-10-31     54.463264
2017-11-30     54.221476
2017-12-31     53.873139
2018-01-31     53.221182
2018-02-28     52.407719
2018-03-31     52.248212
2018-04-30     51.761889
2018-05-31     51.212806
2018-06-30     50.973831
2018-07-31     50.649407
2018-08-31     49.762932
2018-09-30     49.640266
2018-10-31     49.240326
2018-11-30     48.743828
2018-12-31     48.573081
2019-01-31     48.291888
2019-02-28     47.932041
2019-03-31     47.859855
2019-04-30     47.573786
2019-05-31     47.349154
2019-06-30     47.129026
2019-07-31     47.222393
2019-08-31     46.931952
2019-09-30     46.892413
2019-10-31     46.918679
2019-11-30     46.778077
2019-12-31     48.119956
2020-01-31     48.152758
2020-02-29     47.780008
Freq: M, Name: amount, dtype: float64

Exploratory Data Analysis - EDA

In [36]:
import plotly.graph_objects as go

labels = ['Exited','Continued']
values =  [df_merged.status[df_merged['status']=='inactive'].count(), df_merged.status[df_merged['status']=='active'].count()]
colors = ['red', 'darkorange']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_traces(hole=.4, hoverinfo='label+value',  textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(
    title_text="Ratio of Customer Churned and Retained")
fig.show()
In [37]:
df_merged.status.value_counts()
Out[37]:
active      38282
inactive    11508
Name: status, dtype: int64

~23% of customer's have left the telecom company. We cannot use accuracy as our only metric because even if we predict all the customers to be continued, we will still end up getting ~77% accuracy. Our focus should be on churn rate i.e., on the minority class - 'inactive'

Replace active customers with 0 and inactive customers with 1

In [38]:
df_merged['status'] = df_merged['status'].map({'inactive':1,'active':0})

Region wise customer count

In [39]:
region_cust = pd.DataFrame(df_merged.county.value_counts()).reset_index()
region_cust.rename(columns = {'county':'count','index':'county'},inplace = True)
region_cust
Out[39]:
county count
0 Jefferson 785
1 Washington 600
2 Montgomery 566
3 Jackson 471
4 Franklin 459
... ... ...
1089 Hooker 2
1090 Irwin 2
1091 Green 1
1092 Kinney 1
1093 Sutton 1

1094 rows × 2 columns

Jefferson (785) & Washington (600) has most of the customers whereas Green,Kinney regions have only 1 customer

Top 10 regions

In [40]:
top_regions = region_cust['county'][0:10].tolist()
top_regions
Out[40]:
['Jefferson',
 'Washington',
 'Montgomery',
 'Jackson',
 'Franklin',
 'Harris',
 'Monroe',
 'Polk',
 'Dallas',
 'Erie']
In [41]:
region_df = df_merged[df_merged['county'].isin(top_regions)].reset_index(drop = True)
region_df.head()
Out[41]:
name dob doj gender custid state state_id city county doorno ... last_billing_date status amount plan event_date last_event_date event_type description daystoresolve event_date_gap
0 Conner Yvon 1994-04-01 2020-01-06 M 680210839 Ohio OH Columbus Franklin 104 ... 2020-02-05 0 135.0 {'25_Mbps'} [2020-01-10, 2020-01-27, 2020-01-06, 2020-01-3... 2020-01-30 [ticket, order] [service_degradation_internet, new_installatio... 23.0 24
1 Dayan Oluwasemilore 1970-10-07 2017-11-08 M 779075713 Alabama AL Birmingham Jefferson 15 ... 2020-01-27 0 1500.0 {'100_Mbps', '500_Mbps'} [2019-10-29, 2017-11-08, 2019-02-20] 2019-10-29 [ticket, order] [plan_change, service_outage_internet, new_ins... 0.0 720
2 Isabell Antonino 1976-09-11 2016-12-11 F 285273849 Alabama AL Saint Stephens Washington 84 ... 2020-01-25 0 1580.0 {'100_Mbps'} [2019-06-06, 2017-09-02, 2016-12-11] 2019-06-06 [ticket, order] [new_installation, service_outage_voice_call, ... 59.0 907
3 Jamile Hyman 1963-01-27 2017-10-07 M 954090742 Missouri MO Saint Albans Franklin 64 ... 2020-01-25 0 1100.0 {'200_Mbps', '25_Mbps'} [2018-12-01, 2018-07-10, 2019-08-30, 2019-03-0... 2019-11-15 [ticket, order] [service_outage_internet, service_degradation_... 42.0 769
4 Kimberlyann Klint 1948-09-19 2018-09-12 F 253435537 Florida FL Eastpoint Franklin 12 ... 2020-01-05 1 700.0 {'100_Mbps'} [2019-03-02, 2020-01-05, 2018-09-12] 2020-01-05 [ticket, order] [disconnect, new_installation, service_outage_... 7.0 480

5 rows × 24 columns

In [42]:
region_df['county'].value_counts()
Out[42]:
Jefferson     785
Washington    600
Montgomery    566
Jackson       471
Franklin      459
Harris        425
Monroe        407
Polk          372
Dallas        354
Erie          337
Name: county, dtype: int64
In [43]:
# top 10 county vs status(Churn)
sns.factorplot(x='county',y='status', data=region_df,height=4,aspect=3)

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))


sns.countplot(x='county', data=region_df, ax=axis1)
sns.countplot(x='status', hue="county", data=region_df, order=[1,0], ax=axis2)

# group by Geography, and get the mean for Churned customers for each value in Geography
geography_perc = region_df[["county", "status"]].groupby(['county'],as_index=False).mean()
geography_perc.columns = ['county', 'Mean(status)']
sns.barplot(x='county', y='Mean(status)', data=geography_perc,order=['Jefferson',
 'Washington',
 'Montgomery',
 'Jackson',
 'Franklin',
 'Harris',
 'Monroe',
 'Polk',
 'Dallas',
 'Erie'],ax=axis3)
del geography_perc
C:\Users\imarv\anaconda3\envs\bank_env\lib\site-packages\seaborn\categorical.py:3666: UserWarning:

The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.

From the above graph we can observe that customers from Harris, Monroe & Polk have slighly higher average churn rate (~0.25) whilst Jackson has the least average churn rate (0.18) amongst the top 10 frequent regions.

Amount

In [44]:
# Amount
# peaks for Exited/not exited customers by their amount
facet = sns.FacetGrid(df_merged, hue="status",aspect=4)
facet.map(sns.kdeplot,'amount',shade= True)
facet.set(xlim=(0, df_merged['amount'].max()))
facet.add_legend()

# average exited customers by age
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_amount = df_merged[["amount", "status"]].groupby(['amount'],as_index=False).mean()
average_amount.columns = ['amount','Mean(status)']
sns.barplot(x='amount', y='Mean(status)', data=average_amount)
del average_amount
In [45]:
# average amount for active customers
avg_active = np.mean(df_merged['amount'][df_merged['status'] == 0].values)
print('Average amount for active customers: ',avg_active)
# average amount for innactive customers
avg_inactive = np.mean(df_merged['amount'][df_merged['status'] == 1].values)
print('Average amount for inactive customers: ',avg_inactive)
Average amount for active customers:  953.0
Average amount for inactive customers:  807.5

The difference is not that significant. Also the data is imblanced, thus any conclusion made has to be done keeping the ratio of active vs inactive customers in mind.

Plans that is more widely used in each zip code/county

In [46]:
plan_zipcode =  df_merged.groupby('zipcode')['plan'].apply(lambda x: x.mode().iloc[0]).reset_index()
plan_zipcode
Out[46]:
zipcode plan
0 6390 {'200_Mbps'}
1 10001 {'300_Mbps'}
2 10002 {'100_Mbps', '300_Mbps'}
3 10003 {'1_Gbps', '100_Mbps', '300_Mbps'}
4 10004 {'25_Mbps'}
... ... ...
15685 88430 {'200_Mbps'}
15686 88431 {'100_Mbps', '300_Mbps'}
15687 88434 {'100_Mbps'}
15688 88435 {'300_Mbps'}
15689 88436 {'1_Gbps'}

15690 rows × 2 columns

In [47]:
# check - for zipcode = 88436 is 1_Gbps plan the most used?
a = df_merged[df_merged['zipcode'] == 88436].reset_index(drop = True)
a['plan'].value_counts()
Out[47]:
{'1_Gbps'}      2
{'200_Mbps'}    1
{'500_Mbps'}    1
Name: plan, dtype: int64

yes! For zipcode = 88436 is 1_Gbps plan is the most used (2)

In [48]:
plan_status =  df_merged.groupby('status')['plan'].apply(lambda x: x.mode().iloc[0]).reset_index()
plan_status
Out[48]:
status plan
0 0 {'1_Gbps'}
1 1 {'200_Mbps'}

Most customers who have churned out used 200_Mbps plan

In [49]:
# Check 
a = df_merged[df_merged['status'] == 1].reset_index(drop = True)
a['plan'].value_counts()
Out[49]:
{'200_Mbps'}                                        1187
{'500_Mbps'}                                        1167
{'25_Mbps'}                                         1149
{'300_Mbps'}                                        1138
{'1_Gbps'}                                          1088
                                                    ... 
{'500_Mbps', '100_Mbps', '300_Mbps', '25_Mbps'}        3
{'1_Gbps', '300_Mbps', '100_Mbps', '500_Mbps'}         3
{'1_Gbps', '100_Mbps', '200_Mbps', '300_Mbps'}         3
{'100_Mbps', '200_Mbps', '300_Mbps', '500_Mbps'}       3
{'200_Mbps', '500_Mbps', '300_Mbps', '100_Mbps'}       2
Name: plan, Length: 71, dtype: int64

Gender

who is more loyal?

In [50]:
# Gender vs Churn
f,ax=plt.subplots(1,2,figsize=(18,10))
df_merged[['gender','status']].groupby(['gender']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Churn vs Gender')
sns.countplot('gender',hue='status',data=df_merged,ax=ax[1])
ax[1].set_title('Gender:Churned vs Retained')
Out[50]:
Text(0.5, 1.0, 'Gender:Churned vs Retained')
In [51]:
print(df_merged[['gender','status']].groupby(['gender']).mean())
print()
print("# Female - Exited: ",len(df_merged[(df_merged['status'] == 1) & (df_merged['gender'] == 'F')]))
print("# Female - Retained: ",len(df_merged[(df_merged['status'] == 0) & (df_merged['gender'] == 'F')]))
print()
print("# Male - Exited: ",len(df_merged[(df_merged['status'] == 1) & (df_merged['gender'] == 'M')]))
print("# Male - Retained: ",len(df_merged[(df_merged['status'] == 0) & (df_merged['gender'] == 'M')]))
          status
gender          
F       0.232917
M       0.229345

# Female - Exited:  5798
# Female - Retained:  19095

# Male - Exited:  5710
# Male - Retained:  19187

Both Male and Female have almost the same churn rate.

Subscriber Type?

In [52]:
customer_info['subcriber_type'].value_counts()
Out[52]:
kinetic    49998
Name: subcriber_type, dtype: int64

subcriber type is a static column that contains only one value 'kinetic'. Lets drop the column

In [53]:
df_merged.drop('subcriber_type',axis = 1,inplace = True)
In [54]:
df_merged.columns
Out[54]:
Index(['name', 'dob', 'doj', 'gender', 'custid', 'state', 'state_id', 'city',
       'county', 'doorno', 'zipcode', 'phone', 'billing_date',
       'last_billing_date', 'status', 'amount', 'plan', 'event_date',
       'last_event_date', 'event_type', 'description', 'daystoresolve',
       'event_date_gap'],
      dtype='object')

Description of churned customers

In [55]:
desc_status =  df_merged.groupby('status')['description'].apply(lambda x: x.mode().iloc[0]).reset_index()
desc_status
Out[55]:
status description
0 0 [new_installation, service_outage_voice_call, ...
1 1 [new_installation, disconnect]
In [56]:
# Check 
a = df_merged[df_merged['status'] == 1].reset_index(drop = True)
a['description'].value_counts()
Out[56]:
[new_installation, disconnect]                                                                                                                               797
[disconnect, new_installation, service_outage_voice_call, service_degradation_voice_call]                                                                    796
[service_degradation_voice_call, new_installation, disconnect, plan_change, service_outage_voice_call]                                                       601
[plan_change, new_installation, disconnect]                                                                                                                  562
[service_outage_internet, service_degradation_voice_call, new_installation, disconnect, service_outage_voice_call]                                           461
                                                                                                                                                            ... 
[new_installation, disconnect, no_train, plan_change, order]                                                                                                   1
[service_outage_internet, location_move, service_degradation_voice_call, new_installation, others, disconnect, service_degradation_internet, plan_change]      1
[service_outage_internet, new_installation, disconnect, service_degradation_internet, plan_change, service_outage_voice_call, order]                           1
[service_outage_internet, suspend, new_installation, others, disconnect, restore, service_outage_voice_call]                                                   1
[service_outage_internet, network_groom, new_installation, disconnect, service_degradation_internet]                                                           1
Name: description, Length: 466, dtype: int64

Disconnect & New_installation are most used description by the churned customers (797)

Daystoresolve vs Churn

In [57]:
# daystoresolve
y0 = df_merged.daystoresolve[df_merged.status == 0].values
y1 = df_merged.daystoresolve[df_merged.status == 1].values

fig = go.Figure()
fig.add_trace(go.Box(y=y0, name='Continued',
                marker_color = 'blue'))
fig.add_trace(go.Box(y=y1, name = 'Exited',
                marker_color = 'red'))

fig.update_layout(
    yaxis_title='Days to resolve'
)

fig.show()
In [58]:
# daystoresolve
# peaks for Exited/not exited customers by their daystoresolve
facet = sns.FacetGrid(df_merged, hue="status",aspect=4)
facet.map(sns.kdeplot,'daystoresolve',shade= True)
facet.set(xlim=(0, df_merged['daystoresolve'].max()))
facet.add_legend()

# average exited customers by age
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_daystoresolve = df_merged[["daystoresolve", "status"]].groupby(['daystoresolve'],as_index=False).mean()
average_daystoresolve.columns = ['daystoresolve','Mean(status)']
sns.barplot(x='daystoresolve', y='Mean(status)', data=average_daystoresolve)
del average_daystoresolve

No real difference in days to resolve between churned and active customers

Feature Engineering

Age of the customer

In [59]:
df_merged['dob'] = pd.to_datetime(df_merged['dob'])
df_merged['age'] = (pd.to_datetime('today').year - pd.to_datetime(df_merged['dob']).dt.year)
df_merged['age']
Out[59]:
0         56
1         79
2         39
3         31
4        100
        ... 
49785     56
49786     57
49787     77
49788     28
49789     82
Name: age, Length: 49790, dtype: int64
In [60]:
# age
y0 = df_merged.age[df_merged.status == 0].values
y1 = df_merged.age[df_merged.status == 1].values

fig = go.Figure()
fig.add_trace(go.Box(y=y0, name='Continued',
                marker_color = 'blue'))
fig.add_trace(go.Box(y=y1, name = 'Exited',
                marker_color = 'red'))

fig.update_layout(
    yaxis_title='Age'
)

fig.show()

No difference in age among active and inactive customers

Service -- Years from doj

In [61]:
#df_merged['doj'] = pd.to_datetime(df_merged['doj'])
df_merged['service'] = (pd.to_datetime('today').year - pd.to_datetime(df_merged['doj']).dt.year)
df_merged['service']
Out[61]:
0        3
1        2
2        2
3        2
4        2
        ..
49785    2
49786    1
49787    2
49788    2
49789    2
Name: service, Length: 49790, dtype: int64
In [62]:
# Service
# peaks for Exited/not exited customers by their service
facet = sns.FacetGrid(df_merged, hue="status",aspect=4)
facet.map(sns.kdeplot,'service',shade= True)
facet.set(xlim=(0, df_merged['service'].max()))
facet.add_legend()

# average exited customers by age
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_service = df_merged[["service", "status"]].groupby(['service'],as_index=False).mean()
average_service.columns = ['service','Mean(status)']
sns.barplot(x='service', y='Mean(status)', data=average_service)
del average_service

Customer with 1 year (usuage) from Date of joining have the highest Mean Churn rate ~0.33.

Log of amount

In [63]:
df_merged['log_amount'] = np.log10(df_merged['amount'])
df_merged.tail()
Out[63]:
name dob doj gender custid state state_id city county doorno ... plan event_date last_event_date event_type description daystoresolve event_date_gap age service log_amount
49785 Margia Giovan 1964-07-14 2018-10-13 F 296262210 Iowa IA Ames Story 74 ... {'1_Gbps', '200_Mbps', '100_Mbps', '500_Mbps'} [2019-04-14, 2018-10-13, 2019-01-11, 2018-11-1... 2019-11-07 [ticket, order] [plan_change, new_installation, no_train, serv... 20.0 390 56 2 2.888672
49786 Blake Naftoli 1963-02-26 2019-03-12 M 202345517 Florida FL Altoona Lake 117 ... {'300_Mbps', '500_Mbps'} [2020-01-01, 2019-03-12, 2019-05-07, 2019-09-1... 2020-01-01 [ticket, order] [service_outage_internet, service_degradation_... 59.0 295 57 1 2.822266
49787 Reggie Salvador 1943-09-19 2018-02-09 M 969087102 Alabama AL Coker Tuscaloosa 47 ... {'1_Gbps', '300_Mbps', '25_Mbps'} [2018-09-15, 2018-09-07, 2018-12-06, 2018-02-09] 2018-12-06 [ticket, order] [plan_change, new_installation, service_degrad... 3.0 300 77 2 3.031250
49788 Zera Yu 1992-01-28 2018-01-20 F 902249378 Kentucky KY Roark Leslie 22 ... {'200_Mbps', '300_Mbps', '100_Mbps'} [2019-05-23, 2018-01-20, 2019-01-15, 2018-05-1... 2019-05-23 [ticket, order] [service_degradation_voice_call, others, new_i... 16.0 488 28 2 3.035156
49789 Jackey Timmie 1938-03-25 2018-08-07 M 59395124 Pennsylvania PA Hyndman Bedford 58 ... {'100_Mbps'} [2019-01-04, 2019-12-19, 2019-09-26, 2018-08-0... 2019-12-22 [ticket, order] [new_installation, service_outage_voice_call, ... 31.0 502 82 2 2.892578

5 rows × 26 columns

In [64]:
# log Amount
# peaks for Exited/not exited customers by their log amount
facet = sns.FacetGrid(df_merged, hue="status",aspect=4)
facet.map(sns.kdeplot,'log_amount',shade= True)
facet.set(xlim=(0, df_merged['log_amount'].max()))
facet.add_legend()

# average exited customers by age
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_logamount = df_merged[["log_amount", "status"]].groupby(['log_amount'],as_index=False).mean()
average_logamount.columns = ['log_amount','Mean(status)']
sns.barplot(x='log_amount', y='Mean(status)', data=average_logamount)
del average_logamount

Drop columns!

In [65]:
drop_cols = ['name','custid','dob','doj','state_id','doorno','zipcode','phone','billing_date','last_billing_date','event_date',
       'last_event_date','log_amount','description']
df = df_merged.drop(drop_cols,axis = 1)
# Dimensions
print("Number of Rows: {} \nNumber of Columns: {}".format(df.shape[0],df.shape[1]))
df.tail()
Number of Rows: 49790 
Number of Columns: 12
Out[65]:
gender state city county status amount plan event_type daystoresolve event_date_gap age service
49785 F Iowa Ames Story 0 775.0 {'1_Gbps', '200_Mbps', '100_Mbps', '500_Mbps'} [ticket, order] 20.0 390 56 2
49786 M Florida Altoona Lake 0 665.0 {'300_Mbps', '500_Mbps'} [ticket, order] 59.0 295 57 1
49787 M Alabama Coker Tuscaloosa 0 1075.0 {'1_Gbps', '300_Mbps', '25_Mbps'} [ticket, order] 3.0 300 77 2
49788 F Kentucky Roark Leslie 0 1085.0 {'200_Mbps', '300_Mbps', '100_Mbps'} [ticket, order] 16.0 488 28 2
49789 M Pennsylvania Hyndman Bedford 0 780.0 {'100_Mbps'} [ticket, order] 31.0 502 82 2

Correlation check!

In [66]:
cordata = df.corr(method ='pearson')
cordata.style.background_gradient(cmap='summer')
Out[66]:
status amount daystoresolve event_date_gap age service
status 1.000000 -0.126022 -0.005532 0.103525 0.000925 -0.076548
amount -0.126022 1.000000 0.035934 0.730841 0.000847 0.846600
daystoresolve -0.005532 0.035934 1.000000 0.196088 0.007663 0.046562
event_date_gap 0.103525 0.730841 0.196088 1.000000 -0.004951 0.759297
age 0.000925 0.000847 0.007663 -0.004951 1.000000 -0.005975
service -0.076548 0.846600 0.046562 0.759297 -0.005975 1.000000

There are no real highly correlated features.

We can see some independent features are correlated among themselves (amount & service, event_date_gap & sevice, event_date_gap & amount) but it has been stated that multi-collinearity is not an issue when using sckit-learn models.

Source: https://www.linkedin.com/posts/justmarkham_sklearntips-machinelearning-python-activity-6651812212270788609-lhE1/

In [67]:
## Convert column formats
df['event_type'] = df['event_type'].apply(str)
#df['description'] = df['description'].apply(str)

Train, Validation and Test Split

Let's split the data into Train, CV and Test sets.

In [68]:
# dependent variable
y = df['status'].values
# independent variables
X = df.drop(['status'], axis=1)
X.head(2) 
Out[68]:
gender state city county amount plan event_type daystoresolve event_date_gap age service
0 M Minnesota Freeborn Freeborn 1970.0 {'1_Gbps'} ['ticket', 'order'] 24.0 773 56 3
1 M Nebraska Monroe Platte 830.0 {'200_Mbps'} ['ticket', 'order'] 31.0 428 79 2
In [69]:
# train cv test split - stratified sampling
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y,random_state = 17)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.20, stratify=y_train,random_state = 17)

# reset index
X_train = X_train.reset_index(drop = True)
X_cv = X_cv.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

# Save Train CV and test set
X_train.to_csv('Train_churn.csv',index = False)
X_test.to_csv('Test_churn.csv',index = False)
X_cv.to_csv('CV_churn.csv',index = False)

print("train dimensions: ",X_train.shape, y_train.shape)
print("cv dimensions: ",X_cv.shape, y_cv.shape)
print("test dimensions: ",X_test.shape, y_test.shape)
train dimensions:  (31865, 11) (31865,)
cv dimensions:  (7967, 11) (7967,)
test dimensions:  (9958, 11) (9958,)

Data Preprocessing

Preparing data for Model building.! Encoding categorical data and normalizing numerical data.

Normalize numerical data

Each feature is of different scales/units. Features like amount have higher range of values compared to Age, Service. We need to standardise features before feeding them into our Models.

In [70]:
from sklearn.preprocessing import StandardScaler
# features to standardise
cols_norm = ['amount','daystoresolve','event_date_gap','age','service']

sc = StandardScaler()
sc.fit(X_train[cols_norm]) # fit has to happen only on train set

X_train[cols_norm] = sc.transform(X_train[cols_norm])
X_cv[cols_norm] = sc.transform(X_cv[cols_norm])
X_test[cols_norm] = sc.transform(X_test[cols_norm])

print("Standardized!")
X_train.head()
Standardized!
Out[70]:
gender state city county amount plan event_type daystoresolve event_date_gap age service
0 M Iowa Norway Benton -1.510424 {'500_Mbps'} ['order'] -1.066462 -1.545463 -0.942667 -1.108508
1 F Pennsylvania Ludlow McKean 0.290804 {'1_Gbps'} ['order'] -1.066462 -1.545463 -0.319624 0.098607
2 F Minnesota Bertha Todd 0.187877 {'200_Mbps'} ['ticket', 'order'] -0.905665 0.382387 -0.415477 0.098607
3 F Georgia Blythe Richmond -0.923738 {'100_Mbps'} ['ticket', 'order'] -0.865466 -0.415863 0.351345 -1.108508
4 F Florida Pensacola Escambia 1.248028 {'100_Mbps', '300_Mbps', '500_Mbps'} ['ticket', 'order'] -0.423275 1.504455 -1.230225 1.305723

Label Encoding Categorical data

In [71]:
# Encoding Gender
X_train['gender'] = X_train['gender'].apply(lambda x: 1 if x == 'M' else 0)
X_cv['gender'] = X_cv['gender'].apply(lambda x: 1 if x == 'M' else 0)
X_test['gender'] = X_test['gender'].apply(lambda x: 1 if x == 'M' else 0)

# Encoding state
from sklearn.preprocessing import LabelEncoder
le_state = LabelEncoder()
le_state.fit(df['state'])
X_train['state'] = le_state.transform(X_train['state'])
X_cv['state'] = le_state.transform(X_cv['state'])
X_test['state'] = le_state.transform(X_test['state'])

# Encoding city
from sklearn.preprocessing import LabelEncoder
le_city = LabelEncoder()
le_city.fit(df['city'])
X_train['city'] = le_city.transform(X_train['city'])
X_cv['city'] = le_city.transform(X_cv['city'])
X_test['city'] = le_city.transform(X_test['city'])

# Encoding county
from sklearn.preprocessing import LabelEncoder
le_county = LabelEncoder()
le_county.fit(df['county'])
X_train['county'] = le_county.transform(X_train['county'])
X_cv['county'] = le_county.transform(X_cv['county'])
X_test['county'] = le_county.transform(X_test['county'])


# Encoding plan
from sklearn.preprocessing import LabelEncoder
le_plan = LabelEncoder()
le_plan.fit(df['plan'])
X_train['plan'] = le_plan.transform(X_train['plan'])
X_cv['plan'] = le_plan.transform(X_cv['plan'])
X_test['plan'] = le_plan.transform(X_test['plan'])


# Encoding event_type
from sklearn.preprocessing import LabelEncoder
le_event_type = LabelEncoder()
le_event_type.fit(df['event_type'])
X_train['event_type'] = le_event_type.transform(X_train['event_type'])
X_cv['event_type'] = le_event_type.transform(X_cv['event_type'])
X_test['event_type'] = le_event_type.transform(X_test['event_type'])


print('Encoded!')
X_train.head()
Encoded!
Out[71]:
gender state city county amount plan event_type daystoresolve event_date_gap age service
0 1 4 6234 80 -1.510424 70 1 -1.066462 -1.545463 -0.942667 -1.108508
1 0 15 5050 647 0.290804 40 1 -1.066462 -1.545463 -0.319624 0.098607
2 0 6 660 984 0.187877 58 2 -0.905665 0.382387 -0.415477 0.098607
3 0 3 818 840 -0.923738 10 2 -0.865466 -0.415863 0.351345 -1.108508
4 0 2 6670 349 1.248028 6 2 -0.423275 1.504455 -1.230225 1.305723

Model

Assuming all the above X features will be available at the time of prediction (Live/Production), we will continue to build our model

Decision Tree

In [72]:
def batch_predict(clf, data):
    # roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
    # not the predicted outputs

    y_data_pred = []
    tr_loop = data.shape[0] - data.shape[0]%1000
    # consider you X_tr shape is 49041, then your tr_loop will be 49041 - 49041%1000 = 49000
    # in this for loop we will iterate unti the last 1000 multiplier
    for i in range(0, tr_loop, 1000):
        y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1]) # Predict proba - Predicts probability scores
    # we will be predicting for the last data points
    if data.shape[0]%1000 !=0:
        y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
    
    return y_data_pred
In [73]:
def uniform_random_sample(start,end,size=10):
    """
    Function to generate n unique values(uniform random distribution) in the given range "param_range"
    size = n (default = 10)
    """
    random_int = []
    while len(random_int) != size:
        a = np.random.randint(start,end)
        if a not in random_int:
            random_int.append(a)
    random_int.sort()
    return random_int
In [74]:
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from tqdm import tqdm

train_auc = []
cv_auc = []
depths = [1, 5, 10, 50] # list of various max depth values we want to compare.
min_samples_splits = [5,10,100,500] # list of various min_samples_split values we want to compare.

hyperparams = [(d,m) for d in depths for m in min_samples_splits]
hyperparameter_indices = uniform_random_sample(0,len(hyperparams))
params_list = [hyperparams[i] for i in hyperparameter_indices]
depths_ls = [i[0] for i in params_list]
min_samples_splits_ls = [i[1] for i in params_list]

for i in tqdm(params_list):
    # initialize DT Model with max_depth = i[0] and min_samples_split = i[1]
    dt = DecisionTreeClassifier(max_depth=i[0],min_samples_split=i[1],random_state = 20) 
    dt.fit(X_train, y_train) # fit the NB model on the train data
    y_train_pred = batch_predict(dt, X_train) # Predict on the train data    
    y_cv_pred = batch_predict(dt, X_cv) # Predict on cross validation data
    # roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
    # not the predicted outputs        
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv, y_cv_pred))
100%|██████████| 10/10 [00:02<00:00,  4.76it/s]
In [75]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()

x1 = min_samples_splits_ls
y1 = depths_ls
z1 = train_auc

x2 = min_samples_splits_ls
y2 = depths_ls
z2 = cv_auc

# https://plot.ly/python/3d-axes/
trace1 = go.Scatter3d(x=x1,y=y1,z=z1, name = 'train auc')
trace2 = go.Scatter3d(x=x2,y=y2,z=z2, name = 'cv auc')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='min_samples_split'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [76]:
# best hyperparameters
best_depth = 50
best_min_samples_split = 100

Fitting the best model

In [77]:
best_dt = DecisionTreeClassifier(max_depth = best_depth, min_samples_split=best_min_samples_split,random_state = 20)
best_dt.fit(X_train,y_train)
Out[77]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=50, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=100,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=20, splitter='best')
In [78]:
from sklearn.metrics import roc_curve, auc

# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = batch_predict(best_dt, X_train)    
y_test_pred = batch_predict(best_dt, X_test)

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC AUC Curve")
plt.grid()
plt.show()
In [79]:
print('Decision Tree --> Train AUC: ',auc(train_fpr, train_tpr))
print('Decision Tree --> Test AUC: ',auc(test_fpr, test_tpr))
Decision Tree --> Train AUC:  0.958225404214639
Decision Tree --> Test AUC:  0.9411992502090318

The difference between the train and the test auc scores is very low and both the auc scores are high. Thus the model performed well.

Confusion Matrix -- Decision Tree Model

In [80]:
# we will pick a threshold that will give the least fpr
import numpy as np
def find_best_threshold(threshold, fpr, tpr):
    t = threshold[np.argmax(tpr*(1-fpr))]
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    print("The maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    return t

def predict_with_best_t(proba, threshold):
    predictions = []
    for i in proba:
        if i>=threshold:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
In [81]:
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(tr_thresholds, train_fpr, train_tpr)
cm_te = confusion_matrix(y_test, predict_with_best_t(y_test_pred, best_t))
import seaborn as sns
# Reference: https://seaborn.pydata.org/generated/seaborn.heatmap.html
# Confusion matrix on test data
print()
print("Confusion matrix: Test data")
ax= plt.subplot()
sns.heatmap(cm_te,annot=True,fmt='1g',cbar = False,cmap = 'YlOrRd')
ax.set_xlabel('Predicted Class Labels');
ax.set_ylabel('Actual Class Labels');
ax.set_title('Confusion-Matrix: Test');
plt.show()
The maximum value of tpr*(1-fpr) 0.8031392881388808 for threshold 0.256

Confusion matrix: Test data
In [82]:
## Classification report
from sklearn.metrics import classification_report
cr = classification_report(y_test,predict_with_best_t(y_test_pred, best_t))
print(cr)
              precision    recall  f1-score   support

           0       0.97      0.83      0.90      7656
           1       0.63      0.93      0.75      2302

    accuracy                           0.85      9958
   macro avg       0.80      0.88      0.82      9958
weighted avg       0.89      0.85      0.86      9958

Our precision for Positive class (1 --> Churned/Exited) is ~65% which has to be improved. However f1-score is reasonable for the positive class.

XGBoost

In [83]:
import xgboost as xgb

train_auc = []
cv_auc = []
n_trees = [50,100,200,300,500,750,1000] # list of various n_estimator values we want to compare.
l_rate = [0.0001,0.001,0.01,0.1,1.0,10] # list of various learning rate values we want to compare.

hyperparams = [(n,l) for n in n_trees for l in l_rate]
hyperparameter_indices = uniform_random_sample(0,len(hyperparams),size = 15)
params_list = [hyperparams[i] for i in hyperparameter_indices]
n_trees_ls = [i[0] for i in params_list]
l_rate_ls = [i[1] for i in params_list]

for i in tqdm(params_list):
    # initialize XGBoost Model with n_estimators = i[0] and learning_rate = i[1]
    xg = xgb.XGBClassifier(n_estimators=i[0],learning_rate=i[1],random_state = 17) 
    xg.fit(X_train, y_train) # fit the Xgboost model on the train data
    y_train_pred = batch_predict(xg, X_train) # Predict on the train data    
    y_cv_pred = batch_predict(xg, X_cv) # Predict on cross validation data
    # roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
    # not the predicted outputs        
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv, y_cv_pred))
100%|██████████| 15/15 [02:56<00:00, 11.76s/it]
In [84]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()

x1 = l_rate_ls
y1 = n_trees_ls
z1 = train_auc

x2 = l_rate_ls
y2 = n_trees_ls
z2 = cv_auc

# https://plot.ly/python/3d-axes/
trace1 = go.Scatter3d(x=x1,y=y1,z=z1, name = 'train auc')
trace2 = go.Scatter3d(x=x2,y=y2,z=z2, name = 'cv auc')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='Learning Rate'),
        yaxis = dict(title='n_estimators'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [85]:
# best hyperparameters
best_n_estimator = 750
best_l_rate = 0.1
In [86]:
best_xgb = xgb.XGBClassifier(n_estimators=best_n_estimator,learning_rate=best_l_rate,random_state = 17)
best_xgb.fit(X_train,y_train)
Out[86]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=750, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=17, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [87]:
from sklearn.metrics import roc_curve, auc

# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = batch_predict(best_xgb, X_train)    
y_test_pred = batch_predict(best_xgb, X_test)

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC AUC Curve")
plt.grid()
plt.show()
In [88]:
print('XGBoost --> Train AUC: ',auc(train_fpr, train_tpr))
print('XGBoost --> Test AUC: ',auc(test_fpr, test_tpr))
XGBoost --> Train AUC:  0.9992089003422143
XGBoost --> Test AUC:  0.975728025332567

Confusion Matrix -- XGBoost Model

In [89]:
best_t = find_best_threshold(tr_thresholds, train_fpr, train_tpr)
cm_te = confusion_matrix(y_test, predict_with_best_t(y_test_pred, best_t))
import seaborn as sns
# Reference: https://seaborn.pydata.org/generated/seaborn.heatmap.html
# Confusion matrix on test data
print()
print("Confusion matrix: Test data")
ax= plt.subplot()
sns.heatmap(cm_te,annot=True,fmt='1g',cbar = False,cmap = 'YlOrRd')
ax.set_xlabel('Predicted Class Labels');
ax.set_ylabel('Actual Class Labels');
ax.set_title('Confusion-Matrix: Test');
plt.show()
The maximum value of tpr*(1-fpr) 0.9736728874849329 for threshold 0.452

Confusion matrix: Test data
In [90]:
## Classification report -- XGBoost Model
from sklearn.metrics import classification_report
cr = classification_report(y_test,predict_with_best_t(y_test_pred, best_t))
print('XGBoost Model:\n',cr)
XGBoost Model:
               precision    recall  f1-score   support

           0       0.95      0.94      0.95      7656
           1       0.81      0.84      0.82      2302

    accuracy                           0.92      9958
   macro avg       0.88      0.89      0.88      9958
weighted avg       0.92      0.92      0.92      9958

XGBoost have performed well. F1-score, precision and recall are all good.

Feature importance

In [91]:
# feature importance
fe_imp = pd.DataFrame({'Feature':X_train.columns.tolist(),'Importance':best_xgb.feature_importances_})
print(fe_imp.sort_values(by=['Importance'], ascending=False))
import plotly.express as px
fig = px.bar(fe_imp, x='Feature', y='Importance')
fig.show()
           Feature  Importance
10         service    0.315895
8   event_date_gap    0.212255
6       event_type    0.156469
4           amount    0.099427
5             plan    0.078014
7    daystoresolve    0.031080
3           county    0.022496
2             city    0.022378
1            state    0.021947
9              age    0.021472
0           gender    0.018567

Summary

In [92]:
from prettytable import PrettyTable

pt = PrettyTable(['Model','Hyper parameter','Auc on testset'])
pt.add_row(['Decision Tree','best_depth = 50, best_min_samples_split = 100',0.94])
pt.add_row(['XGBoost','n_estimator = 750, learning_rate = 0.1',0.97])
print(pt)
+---------------+-----------------------------------------------+----------------+
|     Model     |                Hyper parameter                | Auc on testset |
+---------------+-----------------------------------------------+----------------+
| Decision Tree | best_depth = 50, best_min_samples_split = 100 |      0.94      |
|    XGBoost    |     n_estimator = 750, learning_rate = 0.1    |      0.97      |
+---------------+-----------------------------------------------+----------------+

Assuming that all the features used during model building will be available for the test set (during Live/Production), the models have performed a good job in predicting the potential churn of the customer.

Service, event_date_gap & event_type were the top three most important features (XGBOOST Model).

Save files for Deployment

In [93]:
import pickle
# Save model
with open("xgbfiles.pickle","wb") as f:
    pickle.dump(sc, f)
    pickle.dump(le_state, f)
    pickle.dump(le_city, f)
    pickle.dump(le_county, f)
    pickle.dump(le_plan, f)
    pickle.dump(le_event_type, f)
    pickle.dump(best_xgb, f)

print("Files Saved!")

"""
# we should read them in the same order as we dumped.
with open("xgbfiles.pickle", "rb") as f:
    sc = pickle.load(f)
    le_state = pickle.load(f)
    le_city = pickle.load(f)
    le_county = pickle.load(f)
    le_plan = pickle.load(f)
    le_event_type = pickle.load(f)
    best_xgb = pickle.load(f)
"""
Files Saved!
Out[93]:
'\n# we should read them in the same order as we dumped.\nwith open("clf_lr.pickle", "rb") as f:\n    sc = pickle.load(f)\n    le_state = pickle.load(f)\n    le_city = pickle.load(f)\n    le_county = pickle.load(f)\n    le_plan = pickle.load(f)\n    le_event_type = pickle.load(f)\n    best_xgb = pickle.load(f)\n'